---
id: tutorial-production-monitoring
title: Monitoring LLMs in Production
sidebar_label: Monitoring LLMs in Production
---

## Quick Summary

While we've thoroughly tested our medical chatbot, it's absolutely necessary to **continue monitoring and evaluating your LLM applications post-production**. This is crucial for improving your LLM application and identifying bugs as well as areas of improvement. Confident AI offers a complete suite of features to help you and your team easily monitor your LLMs in production, including:

- Online Evaluations
- LLM Tracing
- Integrating Human Feedback
- Placing Guardrails

In this section, we'll be focusing on setting up our medical chatbot application for response monitoring and tracing.

:::tip
For an **in-depth dive into LLM observability**, consider reading this detailed article on [LLM Monitoring and Observability](#https://www.confident-ai.com/blog/what-is-llm-observability-the-ultimate-llm-monitoring-guide).
:::

To get started with production monitoring, first login to Confident AI:

```bash
deepeval login
```

## Setting Up Monitoring

Let’s remind ourselves of the main interactive chat method within our `MedicalAppointmentSystem`. We’ll be enhancing this function to monitor real-time responses in production.

```python
class MedicalAppointmentSystem():
    ...
    def interactive_session(self):
        print("Welcome to the Medical Diagnosis and Booking System!")
        print("Please enter your symptoms or ask about appointment details.")

        while True:
            user_input = input("Your query: ")
            if user_input.lower() == 'exit':
                break

            response: AgentChatResponse = self.agent.chat(user_input)
            print("Agent Response:", response.response)
```

DeepEval's `monitor` function enables tracking of key information, including additional hyperparameters or custom data. This information is logged to **Confident AI's Observatory**, which features advanced filtering capabilities, making custom data logging especially impactful when applicable.

:::info
[Visit this section](https://documentation.confident-ai.com) to learn more about leveraging the full capabilities of `deepeval.monitor`.
:::

Let's see how we can use this `monitor` function in our `interactive_session` method:

```python
import deepeval
import time

class MedicalAppointmentSystem():
    ...
    def interactive_session(self):
        print("Welcome to the Medical Diagnosis and Booking System!")
        print("Please enter your symptoms or ask about appointment details.")

        while True:
            user_input = input("Your query: ")
            if user_input.lower() == 'exit':
                break

            start_time = time.time()
            response: AgentChatResponse = self.agent.chat(user_input)
            end_time = time.time()

            print("Agent Response:", response.response)

            # one Api call is all it takes to monitor
            deepeval.monitor(
                event_name="Medical Chatbot",
                model="gpt-4o",
                input=user_input,
                response=response.response,
                retrieval_context=[node.text for node in response.source_nodes],
                completion_time=end_time-start_time,
                # replace with ID of user interacting with your application,
                # leave blank if not available
                distinct_id="user1324",
                # replace with ID of conversation thread from your database,
                # leave blank if not available
                conversation_id="conversation123"
            )
```

In addition to logging mandatory parameters such as `event_name`, `model`, `input`, and `response`, we also log optional parameters such as `retrieval_context`, extracted from the agent response, along with execution time, a mock user ID, and a conversation ID.

:::tip  
If your use case involves a chatbot and is conversational, logging `conversation_id` allows you to analyze, evaluate, and view entire conversational threads.  
:::

## Setting Up Tracing

:::note
This is a bit more complex than native monitoring and so you can skip this and revisit this section later if you're in a hurry.
:::

Next, we’ll set up tracing for our medical chatbot. Since the medical diagnosis bot is built using LlamaIndex, integrating tracing is as simple as adding a **single line of code**. The same applies for LangChain-based LLM applications, making it extremely quick and easy to get started.

Monitoring allows you track your most important parameters, while [Tracing provides full visibility](https://www.confident-ai.com/blog/what-is-llm-observability-the-ultimate-llm-monitoring-guide) into the pathways and components your LLM application utilizes to generate a single response for each unique input.

:::caution
`deepeval.trace_llama_index()` automatically calls `monitor`, so you shouldn't call `monitor` while using tracing to avoid duplicate events. However, if you wish to trace your LLM application while utilizing `monitor` capabilities, hybrid tracing is required, which we’ll cover in Step 3.
:::

```python
deepeval.trace_llama_index()
```

Now, when you deploy your LLM application, you’ll gain full visibility into the steps it takes to generate each response, making it easier to debug and identify bottlenecks.

:::tip  
No matter how you built your application—whether using another framework or from scratch—you can create **custom (and even hybrid) traces** in DeepEval. Explore this [complete guide to LLM Tracing](https://documentation.confident-ai.com/observabilit/llm-tracing) to learn how to set it up.  
:::

## Tracing and Monitor

DeepEval allows you to control what you monitor while leveraging its tracing integrations. To enable hybrid tracing for production response monitoring, you need to call the `monitor()` method at the end of the root trace block.

:::info  
This method has the same arguments (and types) as `deepeval.monitor()`.  
:::

To get started, wrap your chat logic inside a `Tracer` block and replace `deepeval.monitor` with the `monitor` method of the `Tracer` instance. You'll also notice that we'll need to call `set_attributes()` on our instance to avoid any errors.

```python
from deepeval.tracing import Tracer, TraceType

class MedicalAppointmentSystem:
    ...
    def interactive_session(self):
        print("Welcome to the Medical Diagnosis and Booking System!")
        print("Please enter your symptoms or ask about appointment details.")

        while True:
            with Tracer(trace_type=TraceType.AGENT) as agent_trace:

                user_input = input("Your query: ")
                if user_input.lower() == 'exit':
                    break

                start_time = time.time()
                response: AgentChatResponse = self.agent.chat(user_input)
                end_time = time.time()

                agent_trace.monitor(
                    event_name="Medical Chatbot",
                    model="gpt-4o",
                    input=user_input,
                    response=response.response,
                    retrieval_context=[node.text for node in response.source_nodes],
                    completion_time=end_time-start_time,
                    distinct_id="user1324",
                    conversation_id="conversation123"
                )
                print("Agent Response:", response.response)
```

The LLM application is now fully set up for production monitoring and ready for deployment. In the next sections, we’ll explore how to make the most of Confident AI’s features for effective production monitoring.

## Example Interaction

Before diving into the platform, let’s first examine a _mock conversation_ between our medical chatbot and a hypothetical user, Jacob. Jacob is experiencing mild headaches and wants to schedule an appointment with a doctor to address his concerns. We’ll use the Observatory to analyze and review this conversation in the upcoming sections.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_convo_1.png"
    style={{
      marginTop: "0px",
      marginBottom: "0px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_convo_2.png"
    style={{
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderTopLeftRadius: "0px",
      borderTopRightRadius: "0px",
    }}
  />
</div>

## Eyeballing LLM Responses in Production

:::info
Remember how we eyeballed a few input-response pairs in the beginning when [defining our evaluation criteria?](/tutorials/tutorial-metrics-defining-an-evaluation-criteria#eyeballing-llm-responses) Well the idea here is similar. First you inspect some responses, before deciding what metrics you want to use for evaluation in production.

:::

In this section we'll be exploring how to leverage the Confident AI platform for monitoring in production. First, log in to **Confident AI** and navigate to the **Observatory**. Here, you can view and filter for responses, view entire conversational threads, leave feedback, and more.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_monitoring_01.png"
    style={{
      marginTop: "20px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

### Filtering

Let's start by filtering for all the responses from the conversation and model we want to analyze. We'll filter for the conversation thread set up earlier, `conversation_id="conversation111"`, and focus specifically on responses generated by the GPT-4o model.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_monitoring_02.svg"
    style={{
      marginTop: "20px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

:::tip  
In production, deploying multiple versions of your chatbot simultaneously enables (A/B) testing of different models, prompt templates, and configurations. By filtering responses based on these parameters, you can gain valuable insights and **pinpoint the most effective setup** for your specific use case.
:::

### Inspecting Each Response

To inspect each monitored response in more detail, simply click on the corresponding row. The fields displayed in the dropdown **align with the parameters we chose to log** in our `monitor` function. Since we did not log token count or token usage, and this specific response did not prompt our chatbot’s RAG engine tool, the fields for token usage, cost, and retrieval are empty.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_monitoring_03.svg"
    alt="Inspecting Responses in Confident AI"
    style={{
      marginTop: "20px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

:::note
You're also able to view **hyperparameters and custom data**, should you choose to log them, next to the _All Properties_ tab and click _Inspect_ to explore the test data in greater detail, including its traces.
:::

### Detailed Inspection

Clicking **Inspect** opens a side drawer where you can review your response in greater detail. You’ll find tabs for default parameters, hyperparameters, and custom data, as well as additional tabs for metrics and feedback.

:::info  
**Metrics and feedback** will be covered in the next tutorial as part of evaluating models in production.
:::

For now, let’s click on **View Trace** to see the full trace of the path our LLM application took for this response.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_monitoring_04.svg"
    alt="Detailed Inspection in Confident AI"
    style={{
      marginTop: "20px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

### Viewing Traces

We’ll examine the trace of this specific interaction between Jacob and our medical chatbot.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_convo_2.png"
    alt="Detailed Inspection in Confident AI"
    style={{
      paddingBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

As shown below, this response prompted the chatbot to call the RAG tool, initiating a complex workflow involving multiple LLMs, retrievers, and synthesizers.

<video
  width="100%"
  autoPlay
  loop
  muted
  playsInlines
  style={{
    paddingBottom: "20px",
    height: "auto",
    maxHeight: "800px",
  }}
>
  <source
    src="https://deepeval-docs.s3.us-east-1.amazonaws.com/tutorial_monitoring_05.mp4"
    type="video/mp4"
  />
</video>

For agent-based applications, these traces often vary depending on the input, as the chatbot dynamically chooses from multiple paths and tool calls to deliver the best response.

:::tip Reminder  
All of this was seamlessly automated with just a **single line of code**!  
:::
